make `memorynew` intrinsic #55913

oscardssmith · 2024-09-28T05:08:09Z

This speeds up making new Memorys and allow the compiler to better understand what's going on, allowing for LLVM level escape analysis in some cases. There is more room to grow this (currently this only optimizes for fairly small Memory since bigger ones would require writing some more LLVM code, and we probably want a size limit on putting Memory on the stack to avoid stackoverflow. For larger ones, we could potentially inline the free so the Memory doesn't have to be swept by the GC, etc.

Benchmarks:

julia> f(T) = Memory{T}(undef, 2);

julia> @btime f(Int);
  7.879 ns (1 allocation: 48 bytes) #before
  3.959 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Union{Int, Nothing})
  8.426 ns (1 allocation: 48 bytes) #before
  4.017 ns (1 allocation: 48 bytes) #after 

julia> @btime f(Any);
  8.637 ns (1 allocation: 48 bytes) #before
  3.920 ns (1 allocation: 48 bytes) #after 

julia> function g()
           m = Memory{Int}(undef, 2)
           for i in 1:2
              m[i] = i
           end
           m[1]+m[2]
       end

julia> @btime g()
  9.735 ns (1 allocation: 48 bytes) #before
  1.719 ns (0 allocations: 0 bytes) #after

src/cgutils.cpp

oscardssmith · 2024-10-04T03:54:41Z

@gbaraldi so with LLVM assertions enabled I'm getting

julia: /home/oscardssmith/Documents/Code/julia/deps/srccache/llvm-julia-18.1.7-2/llvm/lib/IR/Instructions.cpp:685: void llvm::CallInst::init(llvm::FunctionType*, llvm::Value*, llvm::ArrayRef<llvm::Value*>, llvm::ArrayRef<llvm::OperandBundleDefT<llvm::Value*> >, const llvm::Twine&): Assertion `(i >= FTy->getNumParams() || FTy->getParamType(i) == Args[i]->getType()) && "Calling a function with a bad signature!"' failed.

[398134] signal 6 (-6): Aborted
in expression starting at REPL[3]:1
pthread_kill at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
raise at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
abort at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x7108b6c2871a)
__assert_fail at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
_ZN4llvm8CallInst4initEPNS_12FunctionTypeEPNS_5ValueENS_8ArrayRefIS4_EENS5_INS_17OperandBundleDefTIS4_EEEERKNS_5TwineE at /home/oscardssmith/Documents/Code/julia/usr/bin/../lib/libLLVM.so.18.1jl (unknown line)
CallInst at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1692 [inlined]
Create at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/Instructions.h:1540 [inlined]
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2398
CreateCall at /home/oscardssmith/Documents/Code/julia/usr/include/llvm/IR/IRBuilder.h:2420 [inlined]
emit_memorynew at /home/oscardssmith/Documents/Code/julia/src/cgutils.cpp:4555 [inlined]

which is on the line that does ctx.builder.CreateCall(prepare_call(jl_alloc_genericmemory_unchecked_func), {ptls, nbytes, cg_typ});. Any ideas?

gbaraldi · 2024-10-04T12:07:56Z

I'd print everyone involved here with the way I showed you yesterday thing->print(dbgs()), So print the func and the arguments

oscardssmith · 2024-10-04T20:40:44Z

This now works! For simple examples like Memory{Int}(undef, 3) it's about 7ns compared to 11 previously. Also this version is almost able to be analyzed by our escape analysis to delete the Memory when it doesn't (we just need to teach the llvm-alloc-helpers about gc_loaded, and for non @inbounds accesses we need to teach LLVM that the boundschecks are dead).

gbaraldi · 2024-10-04T20:49:49Z

As an example of what is possible.

Allocopt was able to go from

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  %memoryref_mem = call dereferenceable(40) ptr addrspace(10) @julia.gc_alloc_obj(ptr nonnull %current_task1, i64 40, ptr addrspace(10) addrspacecast (ptr @"+Core.GenericMemory#771.jit" to ptr addrspace(10))), !dbg !14
  %0 = addrspacecast ptr addrspace(10) %memoryref_mem to ptr addrspace(11), !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr addrspace(11) %0, i64 0, i32 1, !dbg !14
  %2 = call nonnull ptr @julia.pointer_from_objref(ptr addrspace(11) %0) #4, !dbg !14
  %3 = getelementptr inbounds i8, ptr %2, i64 16, !dbg !14
  store ptr %3, ptr addrspace(11) %1, align 8, !dbg !14
  store i64 3, ptr addrspace(11) %0, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) %memoryref_mem, ptr %3), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

to. Removing the allocation. Which likely would allow it to just return the 11

define i64 @julia_f_769() #0 !dbg !5 {
top:
  %memoryref_mem = alloca [40 x i8], align 16
  %pgcstack = call ptr @julia.get_pgcstack()
  %current_task1 = getelementptr inbounds i8, ptr %pgcstack, i64 -112, !dbg !14
  call void @llvm.lifetime.start.p0(i64 40, ptr %memoryref_mem)
  %0 = freeze [40 x i8] undef, !dbg !14
  store [40 x i8] %0, ptr %memoryref_mem, align 1, !dbg !14
  %1 = getelementptr inbounds { i64, ptr }, ptr %memoryref_mem, i64 0, i32 1, !dbg !14
  %2 = getelementptr inbounds i8, ptr %memoryref_mem, i64 16, !dbg !14
  store ptr %2, ptr %1, align 8, !dbg !14
  store i64 3, ptr %memoryref_mem, align 8, !dbg !14
  %memoryref_data4 = call ptr addrspace(13) @julia.gc_loaded(ptr addrspace(10) null, ptr %2), !dbg !15
  store i64 2, ptr addrspace(13) %memoryref_data4, align 8, !dbg !15, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data11 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 8, !dbg !32
  store i64 4, ptr addrspace(13) %memoryref_data11, align 8, !dbg !32, !tbaa !20, !alias.scope !24, !noalias !27
  %memoryref_data18 = getelementptr inbounds i8, ptr addrspace(13) %memoryref_data4, i64 16, !dbg !34
  store i64 5, ptr addrspace(13) %memoryref_data18, align 8, !dbg !34, !tbaa !20, !alias.scope !24, !noalias !27
  ret i64 11, !dbg !36
}

oscardssmith · 2024-10-04T20:55:28Z

julia> @btime Memory{Int8}(undef, 3)
  7.169 ns (1 allocation: 32 bytes) # before
  3.986 ns (1 allocation: 32 bytes) # after

giordano · 2024-10-04T20:56:11Z

FixedSizeArrays.jl is going to be much nicer 🙂

src/cgutils.cpp

@inbounds

combined with #55913, the compiler is smart enough to fully remove ``` function f() m = Memory{Int}(undef, 3) @inbounds m[1] = 2 @inbounds m[2] = 2 @inbounds m[3] = 4 @inbounds return m[1] + m[2] + m[3] end ```

src/ccall.cpp

@inbounds

combined with #55913, the compiler is smart enough to fully remove ``` function f() m = Memory{Int}(undef, 3) @inbounds m[1] = 2 @inbounds m[2] = 2 @inbounds m[3] = 4 @inbounds return m[1] + m[2] + m[3] end ```

giordano · 2024-10-11T19:56:13Z

Can you please add an llvm pass test for #56030 (comment) (removing all memory for a simple case where the Memory object doesn't escape)?

oscardssmith · 2024-10-11T20:01:18Z

Do you want an actual LLVM pass, or can I just write a test for 0 allocations?

giordano · 2024-10-11T20:04:19Z

I think an llvm test would be more robust, but probably a simple zero-allocation test would do the job as well.

oscardssmith · 2024-10-11T21:42:42Z

LOL. This test is so good it broke a doctest in performance tips. We're testing to show that you get allocations if you have "bad" code that allocates arrays, but now it doesn't allocate :laughing

oscardssmith · 2024-10-14T16:45:10Z

This is now on top of #55995 (to figure out why we weren't optimizing correctly), but other than that, I think this is good to go!

giordano · 2024-10-15T01:05:15Z

Maybe a test of no allocations in simple cases as discussed above? 🙂

MasonProtter · 2024-11-01T10:07:20Z

Spooky! 👻 🎃

oscardssmith · 2024-11-04T02:12:45Z

@aviatesk any idea why the effects tests are failing?

aviatesk · 2024-11-06T01:53:07Z

@aviatesk any idea why the effects tests are failing?

I guess this issue has been resolved by implementing the effects modeling for the new builtin?

oscardssmith · 2024-11-06T02:20:55Z

yeah, I hadn't realized how good the effects for this intrinsic are.

oscardssmith · 2024-11-06T02:22:40Z

actually CI thinks stuff is still very broken (which is weird given that the test passed locally)

MasonProtter · 2024-11-06T10:53:56Z

So currently this appears to only allow for stack allocation when the size of the Memory is statically known at compile time within the function body. Is there hope for this to happen with dynamically sized Memory, or is that significantly harder to do here?

nsajko · 2024-11-06T11:53:03Z

But stack size is limited. Stack-allocating when the size isn't known at compile time seems dangerous?

MasonProtter · 2024-11-06T11:58:31Z

Could have runtime checks on the size.

oscardssmith · 2024-11-06T12:33:55Z

the dynamic version would be great, but would add an extra ~100 lines of coffee since you would have to make the basic blocks to handle both sides of the allocation

MasonProtter · 2024-11-06T12:47:47Z

Mhm. What I do with the SlabBuffer in Bumper.jl is at the point of allocation, I do something like

if fits_in_slab(sz, slab)
    p = alloca_like_path(sz, slab)
else
    p = @noinline malloc_like_path(sz)
end

with the cleanup phase at the end of the region then either resetting the slab, or freeing the malloc. The more I write this though the more I realize how out of scope this is for this PR so maybe we should discuss elsewhere.

gbaraldi · 2024-11-06T14:04:34Z

So we can probably do something better if we can prove that the dynamically sized array doesn't escape. https://github.com/JuliaLang/julia/pull/52382/files started some work on this. We could for example have it be malloc, it being a stack allocation is difficult because VLAs are sketchy, so we would need to do something like allocate a small buffer and switch to malloc, the extra branch makes me think it should just be a malloc.

giordano · 2024-11-26T23:23:56Z

The doctest failure

│ Subexpression:
│ 
│ for i in [1,4,0]
│     println(i)
│ end
│ 
│ Evaluated output:
│ 
│ 1
│ 1
│ 24820
│ 
│ Expected output:
│ 
│ 1
│ 4
│ 0

is concerning, since a simple for loop is giving completely wrong results. The doctest failure about allocation disappearing is much more welcome instead.

oscardssmith · 2024-11-26T23:25:37Z

yeah. something in alloc_opt is lying about the lifetime of the memory leading it to be removed spuriously. Gabriel and I are looking into it.

The Value printer LLVM uses just prints the kind of instruction so it just shows call. Update llvm-alloc-opt.cpp

no segfault empty mem optimization optimized! fix zeroinit it works It's alive! use checked rather than overflow fix error path sink ptls debugs fixes fix optimization typo cleanup and test mark some tests broken mark some as broken move boundscheck to julia and improve codegen fix-bootsrap off by 1 update overflow message switch wideint_t to overflowing operations address review Co-authored-by: Jameson Nash <[email protected]> Update src/codegen.cpp Co-authored-by: Jameson Nash <[email protected]> Update src/codegen.cpp Co-authored-by: Jameson Nash <[email protected]> Update src/codegen.cpp Co-authored-by: Jameson Nash <[email protected]> Update src/builtins.c Co-authored-by: Jameson Nash <[email protected]> Update src/codegen.cpp Co-authored-by: Jameson Nash <[email protected]> adress review fix tbaa fix issues add test fix build on clang fix

Co-authored-by: Jeff Bezanson <[email protected]>

…mpiles down to a noop most of the time

oscardssmith commented Sep 28, 2024

View reviewed changes

src/cgutils.cpp Outdated Show resolved Hide resolved

nsajko added the arrays [a, r, r, a, y, s] label Sep 28, 2024

oscardssmith added the performance Must go faster label Sep 29, 2024

oscardssmith force-pushed the os/make-memorynew-tfunc branch from 6222082 to b65a483 Compare October 5, 2024 01:12

oscardssmith commented Oct 5, 2024

View reviewed changes

src/cgutils.cpp Outdated Show resolved Hide resolved

This was referenced Oct 7, 2024

teach llvm-alloc-helpers about gc_loaded #56030

Merged

don't capture arrays/tuples in BoundsError #43738

Open

vtjnash reviewed Oct 8, 2024

View reviewed changes

src/ccall.cpp Outdated Show resolved Hide resolved

oscardssmith force-pushed the os/make-memorynew-tfunc branch from b65a483 to 724b8c5 Compare October 11, 2024 19:51

oscardssmith mentioned this pull request Oct 13, 2024

Constructing a 67108864 TiB array succeeds incorrectly, corrupting memory #54328

Open

giordano mentioned this pull request Oct 13, 2024

Memory stores aren't vectorised in a for loop unless explicit at-inbounds is used JuliaArrays/FixedSizeArrays.jl#70

Open

oscardssmith force-pushed the os/make-memorynew-tfunc branch 3 times, most recently from 2c2b098 to e6e26ab Compare October 14, 2024 16:43

oscardssmith requested a review from vtjnash October 14, 2024 16:45

oscardssmith mentioned this pull request Oct 15, 2024

Make BoundsError lazy and move Memory boundscheck to Julia #56167

Draft

oscardssmith force-pushed the os/make-memorynew-tfunc branch from ff6f042 to 79b6ef2 Compare November 5, 2024 21:15

oscardssmith force-pushed the os/make-memorynew-tfunc branch 2 times, most recently from 103c575 to e86f702 Compare November 26, 2024 19:05

gbaraldi and others added 13 commits November 27, 2024 23:03

Improve remarks of the alloc opt pass slightly.

866d9ce

The Value printer LLVM uses just prints the kind of instruction so it just shows call. Update llvm-alloc-opt.cpp

Update src/cgutils.cpp

667b061

Co-authored-by: Jeff Bezanson <[email protected]>

Update src/cgutils.cpp

966c2bc

Co-authored-by: Jeff Bezanson <[email protected]>

Add compiler fence to length store to avoid it being deleted. This co…

e13c0d7

…mpiles down to a noop most of the time

Make the barrier optional on the size of the memory

08ffa93

re-enable gc_loaded allocopt

46c2291

reenable dynamic len version

d061cbc

fix effects

80727ff

fix CI

7dc32d3

fix grow_end

8e736ab

fix gc_loaded analysis

6773c3a

actually fix?

960e98d

oscardssmith force-pushed the os/make-memorynew-tfunc branch from 6e05045 to 960e98d Compare November 28, 2024 04:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make `memorynew` intrinsic #55913

make `memorynew` intrinsic #55913

oscardssmith commented Sep 28, 2024 •

edited

Loading

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

giordano commented Oct 4, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

oscardssmith commented Oct 14, 2024

giordano commented Oct 15, 2024

MasonProtter commented Nov 1, 2024 •

edited

Loading

oscardssmith commented Nov 4, 2024

aviatesk commented Nov 6, 2024 •

edited

Loading

oscardssmith commented Nov 6, 2024

oscardssmith commented Nov 6, 2024

MasonProtter commented Nov 6, 2024

nsajko commented Nov 6, 2024

MasonProtter commented Nov 6, 2024 •

edited

Loading

oscardssmith commented Nov 6, 2024

MasonProtter commented Nov 6, 2024 •

edited

Loading

gbaraldi commented Nov 6, 2024

giordano commented Nov 26, 2024

oscardssmith commented Nov 26, 2024

make memorynew intrinsic #55913

Are you sure you want to change the base?

make memorynew intrinsic #55913

Conversation

oscardssmith commented Sep 28, 2024 • edited Loading

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

gbaraldi commented Oct 4, 2024

oscardssmith commented Oct 4, 2024

giordano commented Oct 4, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

giordano commented Oct 11, 2024

oscardssmith commented Oct 11, 2024

oscardssmith commented Oct 14, 2024

giordano commented Oct 15, 2024

MasonProtter commented Nov 1, 2024 • edited Loading

oscardssmith commented Nov 4, 2024

aviatesk commented Nov 6, 2024 • edited Loading

oscardssmith commented Nov 6, 2024

oscardssmith commented Nov 6, 2024

MasonProtter commented Nov 6, 2024

nsajko commented Nov 6, 2024

MasonProtter commented Nov 6, 2024 • edited Loading

oscardssmith commented Nov 6, 2024

MasonProtter commented Nov 6, 2024 • edited Loading

gbaraldi commented Nov 6, 2024

giordano commented Nov 26, 2024

oscardssmith commented Nov 26, 2024

make `memorynew` intrinsic #55913

make `memorynew` intrinsic #55913

oscardssmith commented Sep 28, 2024 •

edited

Loading

MasonProtter commented Nov 1, 2024 •

edited

Loading

aviatesk commented Nov 6, 2024 •

edited

Loading

MasonProtter commented Nov 6, 2024 •

edited

Loading

MasonProtter commented Nov 6, 2024 •

edited

Loading